Learning DNF Expressions from Fourier Spectrum 
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Abstract 



Since its introduction by Valiant in 1984, PAC learning of DNF expressions remains one of the 
central problems in learning theory. We consider this problem in t he setting where the underlying 
distribu tion is uniform, or more generally, a product distribution. iKalai. Samorodnitskv. and Teng 
( 2009br ) showed that in this setting a DNF expression can be efficiently approximated from its 
"heavy" low-degree Fourier coefhcients alone. This is in contrast to previous approaches where 
boosting was used and thus Fourier coefficients of the target function modified by various dis- 
tributions were needed. This property is crucial for lea rning of DNF exp ressions over smoothed 
product distributions, a learning m odel introduced by Kalai et al.l (|200 9b) and inspired by the 
seminal smoothed analysis model of ISpielman and Tena ( 2004 ). 

We introduce a new approach to learning (or approximating) a polynomial threshold func- 
tions which is based on creating a function with range [—1, 1] that approximately agrees with 
the unknown function on low-degree Fourier coefficients. We then describe conditions under 
which this is sufficient for learning polynomial threshold functions. As an application of our 
approach, we give a new, simple algorithm for approximating any polynomial-size DNF expres- 
sion from its "heavy" low-degree Fourier coefficients alone. Our algorithm greatly simplifies the 
proof of learnability of DNF expressions over smoothed product distributions and is simpler 
than all previous algorithm for PAC learning of DNF expression using membership queries. 
We also describe an application of our algorithm to learni n g mo notone DNF expressions over 
product distributions. Building on the work of IServediol (200J), we give an algorithm that 
runs in time poly((s • log (s/e))^°8 ("/'', n), where s is the size of the DNF ex pression and e i s 
the accuracy. This improves on poly((s • log (ns/e))'°8(*/«)ios(i/«)^ j^) bound of lServediol (|2004l ). 
Another advantage of our algorithm is that it can be applied to a large class of polynomial 
threshold functions whereas previous algorithms for both applications relied on the function 
being a polynomial-size DNF expression. 



1 Introduction 



PAC learning of DNF expressions (or formulae) is the problem posed bv lValiantI ( 19841 ) in his seminal 
work that introduced the PAC model. The original problem asks whether polynomial-size DNF 
expressions are learnable from random examples on points sampled from an unknown distribution. 
Despite efforts by n umerous researchers, t he pr oblem still remains open, with the best algorithm 
taking 2'^'^") time ( Klivans and Servedid . l2004l ). In the course of this work, a number of restricted 
versions of the problem were introduced and studied. One such assumption is that the distribution 
over the domain (which is the n-dimensional hypercube {—1, 1}") is uniform, or more generally, a 



product distribution. In this setting a simp le quasi-polynomial n'^(^°s"') algorithm for learning DNF 
expressions was found by lVerbeurgtl ( 199d ). However, no substantially better algorithms are known 



so far even for much simpler classes such as functions of at most log n- variables (logn-juntas). 

Another natural restriction commonly considered is monotone DNF (MDNF) expressions, i.e. 
those without negated variables. Wit h out r estrictions on the distribution, the problem is no easier 



than the original o i ie (jKearns et al.l . 119871 ) but appears to be easier for product distributions. 



Sakai and Maruoka (2000|) gave a polynomial-time algorithm for logn-term MDNF learning and 
Bshouty and TamonI ( 19961 ) gave an algorithm f or learning a c lass of functions which includes 
0(log n/ log log n)-term MDNFs. Most recently, IServedid ( 2004 ) proved a substantially stronger 
result: s-term MDNFs are learnable to accuracy e in time polynomial in (s-log {ns/e)y°^^^'''' ^^^^'''' 
and n. In particular, his result implies that 0(2^^^'^)-term MDNFs are learnable in polynomial 
time to any constant accuracy. Numerous other rest rictio ns of the original problem were considered. 
We refer the interested reader to Servedio's paper ( 20041 ) for a more detailed overview. 

Several works also considered the problem in the stronger membership query (MQ) model. 
In thi s model the le arner can ask for a value of the unknown function at any point in the do- 



mam, 
a 



ValiantI (119841') gave an effic ient MQ learning algorithm for MDNFs of polynomial size. In 



celebrated result. I Jackson! ( 19971 ) gave a polynomial time MQ learning algorithm for DNFs over 
product dist r ibutio ns. Jackson's algorithm uses the Fourier transform-based learning technique 
( Linial et al.l . Il993l ) and combine s the Kushilevitz-Mansour algorithm for finding a "heavy" F ourier 



coefficient of a boolean function (Goldreich anc 



Levinl . Il989l . iKushilevitz and Mansourl . 1 19931 ) with 



the Boosting- by-Majority algorithm of iFreundl (119951'). A similar approach was used in the sub- 
sequent improv ements to Jackson's algorithm (JKlivans and Servedid . l2003l . iBshoutv et al.l . |2004| . 
Feldmanl . [JOOtI v 

The access to membership queries is clearly a very strong assumption and is unrealistic in 
most learning applications. Seve ral works give DNF learning algorithms which relax this require- 
ment: the learning algorithm of iBshoutv and FeldmanI (J2002l ) us es random exa r nples from prod- 
uct distributions chosen by the algorithm and the algorithm of IBshoutv et al.l ( 20051 ) uses only 
examples produced by a random walk on the hypercube. Another approach is to relax the re- 
quirement that the PAC algorithm succeeds on all polynomial-size DNF formulae and require it to 
succeed on a randomly chos en expression generated from some simple distribution ove r the formulae 



2011 



and ISelh J tom . 



Aizenstein and Pittl. 1 19951 ) . Strong results of this form were achieved recently by I Jackson et al. 



A ii ew way to avoid the worst-case hardness of learning DNF was recently proposed by lKalai et al 
(J2009bl ) . Their model is inspired by the seminal model of smoothed analysi s introduced in the con- 
text of optimization and numerical analysis by ISpielman and Tengi ( 20041 ) . Smoothed analysis is 
based on the insight that, in practice, real- valued inputs or parameters of the problem are a result 
of noisy and imprecise measurements. Therefore the complexity of a problem is meas ured not on 



the wo rst-case values but on a random perturbation of those values. In the work of iKalai et al. 



(J2009bl ) the perturbed parameters are the expectations of each of the coordinates of a product 
distribution over { — 1,1}". In a surprising result they showed that DNF formulae are learnable 
efficiently in this model (and that decision trees are even learnable agnostically) . 

A c rucial and the most involved component of the DNF learning algorithm of iKalai et al. 
( 2009bl ) is the algorithm that - given all "heavy" (here this refers to those of inverse-polynomial 



magnitude), low-degree (logarithmic in the learning parameters) Fourier coefficients of the target 
DNF / to inverse-polynomial accuracy - finds a function that is e-close to /. Such an algorithm 



is necessary since, in the boosting-based approach of I JacksonI (jl997l ). the weak learner needs to 
learn with respect to distributions which depend on previous weak hypotheses. When learning over 
a smoothed product distribution, the first weak hypothesis depends on the specific perturbation 
and therefore in the subsequent boosting stages, the param e ters of the product distribution can 
no longer be thought of as perturbed randomly. iKalai et al.l ( 2009bl ) show that this is not only a 



matter of complications in the analysis but an actual limitation of the boosting-based approach. 
Therefore they used an algorithm that first collects all the "heavy" low-degree Fourier coefficients 
and then relies solely on this information to approximate the target function. 

1.1 Our Results 

We describe a new approach to the problem of learning a polynomial threshold function (PTF) from 
approximations of its "heavy" low-degree Fourier coefficients, a problem we believe is interesting in 
its own right. The approach exploits a generalization of a simple structural result about any s-term 
DNF /: for every function g : { — 1, 1}" — >• [—1, 1], the error of g on f (measured as Eu[\f{x)—g{x)\]) 
is at most 7 • (2s + 1), where 7 is the magnitude of th e largest difference between two corresponding 
Fourier coefficients of / and g (IKalai et al.l . l2009bl ). We use / to denote the vector of Fourier 



coefficients of / and so this difference can be expressed as ||/ — 5||oo- Hence to find a function e- 
close to / it is sufficient to find a function g such that ||/ — ffHoo < ^/(^s + l)i in other words, g that 
has approximately (in the infinity norm) the same Fourier spectrum as /. We give a new, simple 
algorithm (Th. I4.l( l that constructs a function (with range in [—1, 1]) which has approximately the 
desired Fourier spectrum. 

Our algorithm builds g in a fairly straightforward way: starting with a constant go = function 
we iteratively correct each coefficient to the desired value (by adding the difference in the coefficients 
multiplied by the corresponding basis function). After each such step the new function gt might 
have values outside of [—1,1]. We correct this by "cutting-ofF' values outside of [—1,1] (in other 
words, project them to [—1, 1]). A simple argument shows that both of these operations reduce 
11/ ~ 9t\\2 — ^u[f{x) — gt{x))'^]- The coefficient correction procedure reduces this squared distance 
measure significantly and implies the convergence of the algorithm. In addition, through a slightly 
more complicated potential argument we show that there is no need to perform the projection after 
each coefficient update; a single projection after all updates suffices (Th. 14. 3p . This implies that 
the function we construct via this algorithm is itself a polynomial threshold function (PTF). 

To generalize our approach to product distributions, we strengthen the structural lemma about 
DNF expressions to measure the error in terms of the largest difference between corresponding low- 
degree Fourier coefficients and extend it to product distributions (Th. l3.8p . The algorithm itself uses 
the Fourier basis for the given product distribution but otherwise remains essentially unchanged. 
We also give a more general condition on PTFs that is sufficient for bounding Ejy[|/(x) — ^(x)!] 
in terms of largest difference between corresponding low-degree Fourier coefficients of / and g. 
The general condition implies that our algorithm can also be used to learn any integer-weight 
linear threshold of terms as long as the sum of the magnitudes of weights (or the total weight) is 
polynomial. 

We give several applications of our approach. The most immediate one is to obtain a sim- 
ple algorithm for learning DNF expressions over product distributions with membership queries 
(Cor. 15. ip . Given access to membership queries, the Fourier spectrum of any function can be ap- 
proximated u sing the well-known Kushileyitz-Mansour algorithm and its generalization to product 
distributions ( Goldreich and Levinl . Il989l . iKushilevitz and Mansourl . ll993l ). We can then apply our 



approximation algorithm to get a hypothesis which is e close to the target function. While tech- 
nically our iterative algorithm is similar to boosting, the resulting algorithm for learning DNF is 
simpler and more self-contained than previous boosting-based algorithms. 

The second application of our approximation algorithm and the motivation for this work is 
its use in the context of smoothed analysis of learning D NF over product d istributions (Th. 15. 4p 
where the pr oblem was o r iginally formulated and solved bv lKalai et al.l ( 2009bl ). The approximation 
algorithm of lKalai et al.l (l2009bl) i s based on an elaborate combination of the positive-reliable DNF 
learning algorithm of iKalai et al.l (J2009al ) and the agnostic learning algorithm for decisions trees of 
Gopalan et al.l ( 20081 ). In contrast, our algorithm gives a natural solution to the probl em which is 



signific antly simpler technically and is more general. We also note that the algorithm of lKalai et al 



( 2009bl ) does not construct a function with Fourier transform close to that of / and is not based 
on the structural results we use. 

In another application of our approach we give a new algorithm for learning MDNF expressions 
over product di stributions. Our algorithm is based on Servedio's algorithm for learning MDNFs 
( Servedid . 12004 ). The main idea of his algorithm is to restrict the target function to influential 
variables, those that can change the value of the target function with significant probability. For any 
monotone function, influential variables can be easily identified. Then all the Fourier coefficients of 
low degree and restricted to influential variables are estimated individually from random examples. 
The sign of the resulting low-degree polynomial is used as a hypothesis. T he degree for wh ich such 
an approximation method is known to work is 20 • log (s/e) • log(l/e)) (JMansoun . Il995l ). Using 
our simple structural result about DNF and our algorithm for constructing a function with desired 
Fourier coefficients, we show (Th. 15. 5p that to achieve e-accuracy coefficients of degree at most 
0(log (s/e)) are sufficient. This results in poly ((s • log (s/e)Y°^ ^^'''\n) time algorithm improving on 
poly((s • log (ns/e))l°s("/^>'°s(^/^),n) bound of iservediol (J2004l ). 

Related work. A closely related probler u of finding a f u nctio n with specified correlations with 
a given set of functions was considered bv iTrevisan et al.l ( 20091 ) and their solution is based on a 
similar algorithm (with a more involved analysis). Our setting differs in that the set of functions 
with which correlations are specified has a super polynomial size and the functions are not necessarily 
boolean (when the distribution is non-uniform). 

In the Chow Parameter problem the goal is to find an approximation to a linear threshold 
function (LTF) f frorn . its d egree- 1 and degree-0 Fourier coefficients (the Chow parameters). 



O'Donnell and Servedid ( 20 111 ) gave the first algorithm for the problem which is based on find- 
ing a function whose Chow parameters are close in Euclidean distance to those of / (as opposed 
to II • I loo distance in our problem). Then they used an intricate structural result about LTFs to 
derive an approximation bound. Their algorithm is based on a brute-force search of some of the 
Chow parameters. A very recent, doubly exponential improvement to the solution of the problem 
was obtained using a new, stronger structural result and a new algorithin for c onstructing a linear 



threshold function from approxima tions of Chow parameters JBl^T^ . H). As in our apph 



cations, the algorithm of iDe et al.l ( 20121 ) constructs a bounded function with the given degree-1 
Fourier spectrum. However the update step of their algorithm is optimized for minimizing the 
Euclidean distance of the Chow parameters of the obtained function to the given ones. 

Organization. Structural results required for approximating DNF expressions and PTFs are 
given in Section [31 In Section |4] we describe our main algorithm for constructing a function with 
the desired Fourier spectrum. In Section [5] we give applications of our approach. 



2 Preliminaries 

For an integer k, let [k] denote the set {1, 2, . . . , A;}. For a vector v G M , we use the following 
notation for several standard quantities: \\v\\o = \{i G [k] \ vi / 0}|, ||v||i = X^jeffcibil) ll^lloo = 



maxjgjfcjlltijl} and ||7;||2 = \ I^Zii^iB'^'i • -^°^ ^ ^^^^ value a, we denote its projection to [—1,1] by 
Pi (a). That is, -Pi (a) = a if |a| < 1 and Pi (a) = sign (a), otherwise. 

We refer to real-valued functions with range in [—1, 1] as hounded. Let P^ = {a € {0, 1}" | ||a||o < 
d}. For a G {0,1}" let Xa{x) denote the function na=i^«- -'■^ i^ ^ monomial and also a parity 
function over variables with indices in {i < n | aj = 1}. A degree-d polynomial threshold function 
is a function representable as sign(^^g^ w{a)xa{x)) for some vector of weights w S M'^'*. When 
the representing vector w is sparse we can describe it by listing all the non-zero coefficients only. 

We refer to this as being succinctly represented. 

PAC learning. Our learning model is Valiant's ( 19841 ) well-known PAC model. In this model, for 



a concept / and distribution D over {—1, 1}"", an example oracle Fi^{f,D) is an oracle that, upon 
request, returns an example (x, f{x)) where x is chosen randomly with respect to D, independently 
of any previous examples. A membership query (MQ) learning algorithm is an algorithm that 
has oracle access to the target function / in addition to EX(/, P)), namely it can, for every point 
X E {—1, 1}" obtain the value /(x). For e > 0, we say that function g is e-close to function / 
relative to distribution D if Pr£)[/(x) = g{x)] > 1 — e. For a concept class C, we say that an 
algorithm A efficiently learns C over distribution D, if for every e > 0, n, f G C, A outputs, with 
probability at least 1/2 and in time polynomial in n/e, a hypothesis h that is e-close to / relative 
to D. Learning of DNF expressions is commonly parameterized by the size s (i.e. the number of 
terms) of the smallest-size DNF representation of /. In this case the running time of the efficient 
learning algorithm is also allowed to depend polynomially on s. For k G [n] an s-term /c-DNF 
expression is a DNF expression with s terms of length at most k. 

Fourier transform. A number of methods for learning over the uniform distribution U are 
based on the Fourier transform technique. The technique relies on the fact that the set of all parity 
functions {Xaix)}ae{o,i}" forms an orthonormal basis of the linear space of real-valued function 
over { — 1,1}" with inner product defined as {f,g)u = Ei{[f{x)g{x)]. This fact implies that any 
real-valued function / over { — 1, 1}" can be uniquely represented as a linear combination of parities, 
that is /(x) = X^^gjo ijn f{0')Xa{x). The coefficient /(a) is called Fourier coefficient of / on a and 

equals Eu[f{x)xa{x)]; \\a\\o is called the degree of /(a). For a set 5 C {0,1}" we use f{S) to 
denote the vector of all coefficients with indices in S and / to denote the vector of all the Fourier 
coefficients of /. The vector of all degree-(< d) Fourier coefficients of / can then be expressed as 
f{Bd). We also use a similar notation for vectors of estimates of Fourier coefficients. Namely, for 
•S* ^ {0, 1}" we use f{S) to denote a vector in M"^ indexed by vectors in S. We denote by f{a) the 
a-th element of f{S). Whenever appropriate, we use succinct representations for vectors of Fourier 
coefficients (i.e. listing only the non-zero coefficients). 

We will make use of Parseval's identity which states that for every real-valued function / 
over {—1,1}", 'Eu[f'^\ = Ylafi'^)'^ ~ ll/lli- Given oracle access to a function / (i.e. member- 
ship queries), the Fourier t r ansform of a function can be a pproximated using the KM algorithm 
( Goldreich and Levinl . 1 19891 . iKushilevitz and Mansourl . 1 19931 ) 



Theorem 2.1 (KM algorithm) There exists an algorithm that for any real-valued function f : 
{ — 1,1}" — )• [—1,1], given parameters 9 > 0, 6 > and oracle access to f, with probability at least 



1 — S, returns a succinctly represented vector f, such that \\f — /||oo < 6 and ||/||o < 4/^^. The 
algorithm runs in 0{n'^ • 9^^ ■ log (1/(5)) time and makes 0{n ■ 9^^ • log (1/(5)) queries to f . 

Product distributions. We consider learning over product distributions on {—1, 1}". For a 
vector ^ G (—1, 1)" let D^ denote the product distribution over {—1, 1}" such that Eix^D^[xi] = m 
for every i G [n]. For each i G [n], Xj = 1 with probability (1 + ^j)/2. For c G (0, 1] the distribution 
D^ is said to be c-bounded if /i G [— 1 + c, 1 — c]". The uniform distribution is then equivalent to 
Dq, where is the all-zero vector, and is 1-bounded. We use E^[-] to denote E^^£)^[-] and E[-] to 
denote Ea;^iY[-] and similarly for Pr. 

The Fourier transform technique extends naturally to product distributions ( Furst et al.Lll99ll ). 



For /x G (— Ijl)"- the inner product is defined as {f,g)^ = E^[f{x)g{x)]. The corresponding 
orthonormal basis of functions over D^ is given by the set of functions {(f)^^a \ o, G {0, 1}"}, 
where (j)ua{x) = Yla =i ^^~^\ ■ Every function / : { — 1, 1}" — )• M can be uniquely represented as 

fix) = J2ae{o,i}" fi^('^)^i^,aix)^ where the /i-Fourier coefficient /^(a) equals E^[/(x)(/>/,,a(a;)]. We 
extend our uniform-distribution notation for vectors of Fourier coefficients to product distributions 
analogously. For any product distribution fi, a degree-d polynomial p{x) has no non-zero /x-Fourier 
coefficients of degree greater than d. 



The KM a lgorithm has been extended to product distribution s bv iBellard (1199 ll) (see also 



Jacksonl . Il997l ). Below we describe a more efficient version given bv lKalai et alj (|2009bl ) (referred 



to as the EKM algorithm) which is efficient for all product distributions. 

Theorem 2.2 (EKM algorithm) There exists an algorithm that for any real-valued function 
f : { — 1, 1}" — )• [—1, 1], given parameters 9 > 0, 6 > 0, fi £ (—1, 1)", and oracle access to f , with 
probability at least 1 — 5, returns a succinctly represented vector /^, such that \\ffj, — ffj,\\oo 1^ 9 and 
WfuWo ^ 4/0^. The algorithm runs in time polynomial in n, 1/9 and log (1/(5). 

When learning relative to distribution Z)^ we can assume that fi is known to the learning algorithm. 
For our purposes a sufficiently-close approximation to /u can always be obtained by estimating /ij 
for each i using random samples from D^. 

Without oracle access to /, but given examples of / on points drawn randomly from D^ one 
can estimate the Fourier coefficients up to degree d by estimating each coefficient individually in 
a straightforward way (that is, by using the empirical estimates). A naive way of analyzing the 
number of samples required to achieve certain accuracy requires a number of samples t hat depends 



on p, a, nd the degree of the estimated coefficient (since \4>^^a{x)\ depends on them). iKalai et al 



ai] ) gave a more refined analysis wMch ehmmates the dependence on d and ^ and implies the 



following theorem. 

Theorem 2.3 (Low Degree Algorithm) There exists an algorithm that for any real-valued func- 
tion f : { — 1, 1}" — )• [—1, 1] and ^ G (—1, 1)", given parameters d G [n\, > 0, (5 > 0, and access 
to EX{f,Dfj,), with probability at least 1 — 5, returns a succinctly-represented vector f^, such that 
\\f^{Bd) — f^{Bd)\\oo < 9 and ||/^||o < 4/6*^. The algorithm runs in time n'^ ■ poly{n-9~^ -log {1/5}). 

3 Structural Conditions for Approximation 

In this section we prove several connections relating the Li distance of a low-degree PTF / to 
a bounded function g (i.e. E[|/(x) — (7(x)|]) and the maximum distance between the low-degree 

6 



portions of the Fourier s pectrum of / and q (i.e. \\f{Bd) — g{Bd)\\oo)- A special case of such a 
connection wa ,s proved by Kalai et al.l ( 2009bl ) . Another special case, for linear threshold functions, 
was given bv iBirkendorf et al.l ( 19981 ). Our version yields strong bounds for every PTF /(x) = 
sign(p(x)) where polynomial p{x) satisfies \p{x)\ > 1 for all x and p{x) is close to a low-degree 
polynomial p'{x) of small || • ||i norm. In particular, it applies to any function representable as an 
integer- weight low-degree PTF of polynomial total weight and to any integer- weight linear threshold 
of terms (ANDs) of polynomial total weight (which includes polynomial size DNF expressions). We 
start by defining two simple and known measures of complexity of a degree-d PTF. 



Definition 3.1 For X > 0, we say that a polynomial p{x), A-sign-represents a boolean function 
f{x) if for all x G {— 1, l}", f{x) = s\gr\{p{x)) and \p{x)\ > A. For a degree-d PTF f, let Wi{f) 
denote 

min{||p||i I p 1- sign-represents /}. 

The degree-d total integer weight of f is 

TW^if) = min{||p||i | p is integer and f = sign(p)}. 

Remark 3.2 We briefly remark that Wf (f) is exactly the inverse of the advantage of a degree-d 
PTF defined bv lKrause and Pudldii \1991i ) as the largest A for which there exists a polynomial p{x) 
such that p X- sign-represents f and \\p\\i = !)■ In addition, linear programming duality implies that 
the advantage of f equals a if and only if a is the smallest value such that for every distribution D 
over {—1, 1}"" there exists a monomial \a(x) of degree at most d such that |E£)[/(x) • Xa(x)]| > a 
(see Nisan's proof in I(lmpagliazza . \l993i) ). Finally, clearly Wf{f) < TW^ jf). The char acterization 
of advantage using the LP duality together with the boosting algorithm bv lFreund 1(1 99a ) imply that 
TW''{f) = 0{n-Wt{ff). 



We first prove a simpler special case of our bound when the representing polynomial p{x) and 
the approximating polynomial p'{x) are the same. 

Lemma 3.3 Let p{x) be a degree-d polynomial that 1- sign-represents a PTF f{x). For every 
fj, E (—1, 1)" and bounded function g(x) : { — 1, 1}" — )• [—1, 1], 



E^[\f{x)-g{x)\]<\\U{B,)-g^{Ba)\\c 



.{Bd 



Proof: First note that for every x, the values f{x),f{x) — g[x) and p{x) have the same sign. 
Therefore E^[|/(2;) - g{x)\] = ^^[f{x){f{x) - g{x))] < E^^[p{x){f{x) - gix))]. From here we 
immediately get that 

Bf,lp{x){f{x)-g{x))] = Y^ p^(a)E^[(/(x) -5(x))0/,,a(x)] = Y^ p^{a){f^{a)-g^{a)) 

a&Ba aeBa 

<\\U{Bd)-g^{Bd)\\o.-\\Pf.{Bd)\\i. 



D 
To apply our bound to functions which are close (but not equal) to a degree-d PTF we also give 
the following approximate version of Lemma 13.31 



7 



Lemma 3.4 Let p{x) be a polynomial that 1- sign-represents a PTF f{x) and let p'{x) he any 
degree-d polynomial. For every ^ £ ( — 1) 1)" CLnd a bounded function g{x) : { — 1, 1}" — >• [—1, 1], 

E^[|/(x) -5(x)|] < \\U{Ba) - g^{BMoo ■ \\p'^{Bd)h + 2-E^[\p\x)-p{x)\]. 

Proof: Following the proof of Lemma 13.41 we get 

E,,[|/(x) -5(x)|] < E^[p(^)(/(x) -5(x))] 

= E^[p'(x)(/(x) - g{x))] + E^[(p(x) - p'{x)){f{x) - g{x))] 



< 



,{Ba) - g^{BMoo ■ \\p'{Bd)\\i +E^[2|p'(x) -p(x) 



D 
We now give bounds on such representations o f DNF expressions. As a warm-up we start with 
the uniform distribution case which is implicit in ()Kalai et al.l . l2009bl ) . 

Lemma 3.5 For any s-term DNF f, W^{f) < 2s + 1. 

Proof: Let ti{x),t2{x), . . . ,ts{x) denote the {0, 1} versions of each of the terms of /. For each 
i G [s] let Tj denote the set of the indices of all the variables in the term ij. Then, tj = riieT, 2 ^ ' 
where the si gn of each variable Xj is determined by whether it is negated or not in tj. As is well- 
known (e.g. iBlum et al.l . 11994 ). this implies that ||tj||i = 1. Now, let p{x) = 2J2i^\s] ^i(^) ~ 1- It 
is easy to see that, \p{x)\ > 1, f{x) = s\gn{p{x)), p{x) and 

IIpIIi < 2^||ti||i + l<2s + l . 
ie[s] 

D 



An immediate corollary of Lemma l3.3l and Lemma FS.Sl is the following bound given by lKalai et al. 
(l2009bl ). 

Corollary 3.6 Let f be an s-term DNF expression. For every bounded function g{x), E[|/(x) — 
g{x)\]<{2s + l)-\\f-g\\^. 

As can be seen from of Lemma [3.51 bounding TV"(/) is based on bounding ||tj||i for every term 
ti of a DNF expression. Therefore we next prove a product distribution bound on ||ti||i. 

Lemma 3.7 Let t{x) be a {0, 1} AND of d boolean literals, that is, for a set of d literals T C 
{xi,xi,X2,X2, ... ,Xn,Xn}, t{x) = 1 whcn all literals in T are set to 1 in x and otherwise. For 
any constant c G (0, 1] and ^ G [— 1 + c, 1 — c]", 



C/i||i 



\i,{BMi<i2-cY'\ 



Proof: Let S denote the set of all vectors in {0, 1}" corresponding to subsets of T, that is 

5 = {a| ViG [n], {ai=^\l{x,,Xi}r\T^{D)}. 
Clearly, \\ty\\i = \\t^{Bd)\\i = ||t^(S')||i. In addition, by Parseval's identity 



'-/i|l2 



E^[t(x)2] = Vv^[t{x) = !]<(!- c/2f . 



Now, by the Cauchy-Schwartz inequality, 

\\US)h < 2"/^ ■ \\ij2 = 1"!'' ■ (1 - c/2)'^/2 = (2 - cfl^ , 

giving us the desired bound. D 

We now use Lemmas 13.41 and 13.71 to give a bound for all product distributions. 

Theorem 3.8 Let c £ (0, 1] be a constant, fi be a c-bounded distribution and e > 0. For an integer 
s > let f be an s-term DNF. For d = [log (s/e)/ log (2/(2 — c))J and every bounded function 
5:{-l,ir^[-l,l], 

E^[|/(x) - g{x)W < (2 • (2 - cf/^ ■ s + I) ■ \\f^{Bd) - g,.{Bd)U + 4e. 

Proof: As in the proof of Lemma [531 let ti{x),t2{x), . . . ,ts(x) denote the {0, 1} versions of each 
of the terms of / and let p{x) = 2 X^iefsi U{x) — 1 be a polynomial that 1-sign-represents /. Now let 
M C [s\ denote the set of indices of /'s terms which have length > d+l> log (s/e)/ log (2/(2 — c)) 
and let p'{x) = '^Yliau^ii^) ~ -'-• -'■^ other words, p' is p with contributions of long terms removed 
and, in particular, is a degree-d polynomial. 

For each i G M, B^[ti{x)] = Pr^[tj(x) = !]<(!- c/2)'^+^ < e/s. This implies that 

E,[\p'{x)-p{x)\] <Y^E,[2Ux)\]<2e . (1) 

Using Lemma 13.71 we get 

Wp'^iBMi < 2 ^ ||t;^(i?d)||i + 1 < 2 • (2 - c)'^/2 -s + l. (2) 

We can now apply Lemma 13.41 and equations ([HE]) to obtain 

E^[\f{x)-g{x)\] < \\U{B,) - g,{B,)\\^ • \\P ,{B,)\\i + 2E^[\p'{x) - p{x)\] 
< (2 • (2 - c)'^/2 . s + 1) . 11/^(5^) _ h^{Bd)\\oo + 4e. 

D 
It is easy to see that Theorem 13 . 81 generalizes to any function that can be expressed as low-weight 
linear threshold of terms. Specifically, we prove the following generalization (the proof appears in 
Appendix [A|). 

Theorem 3.9 Let c G (0, 1] be a constant, fi be a c-bounded distribution and e > 0. For an 
integer s > let f = h{ui,U2, ■ ■ ■ ,Us), where h is an LTF over {—1, 1}* and Ui 's are terms. For 
d= [log (VF/(/i)/e)/ log (2/(2 — c))J and every bounded function g :{— 1,1}"- ^ [— 1)1]> 

E^[|/(x) - g{x)\] < (2 • (2 - cf^ + 1) • Wl{h) • 11/^(5^) - ^^(i?,)||oo + 46. 
For c = 1, (2 - 0)^^/2 = 1 and for c G (0, 1), (2 - c)'^/^ < (H^ii(/i)/e)(Vlog(2/(2-c))-i)/2_ 



4 Construction of a Fourier Spectrum Approximating Function 

As follows from Corollary 13.61 (and Th. 13. 8|) . to e-approximate a DNF expression over a product 
distribution, it is sufficient to find a bounded function g such that g has approximately the same 
Fourier spectrum as /. In this section we show how this can be done by giving an algorithm which 
constructs a function with the desired Fourier spectrum or the low-degree part thereof. 

Our algorithm is based on the following idea: given a bounded function g such that for some 
a, I /(a) — g{a)\ > 7 we show how to obtain a bounded function gi which is closer in L2 distance 
squared to / than g. Parseval's identity states that E[(/ — g^] = J2b(f(^) ~ di^))"^- Therefore to 
improve the distance to / we do the simplest imaginable update: define g' = g + (/(«) — 5(0) )Xa- 
In other words g' is the same as g but with a's Fourier coefficient set to /(a). Clearly, 

E[(/ - g'f] = Y.(f(b) - m? = E[(/ - gf] - (/(a) - mf < E[(/ - gf] - 7^ 

The only problem with this approach is that g' is not necess arily a function with values bounded 
in [—1, 1]. However, following the idea from ( Feldmanl . l2009l ) . we can we convert g' to a bounded 



function gi by cutting-off all values outside of [—1, 1] (which is achieved by applying the projection 
function Pi). The target function / is boolean and therefore this step can only decrease the L2 
distance squared to /. This simple argument implies that starting with g = we can update it 
iteratively until we reach a bounded function gt such that for all a, \f{a) — g{a)\ < 7. The decrease 
in the L2 distance squared at every step implies that the total number of steps cannot exceed 1/7^. 
Also note that for running this algorithm the only thing we need are (the approximate values of) 
the Fourier coefficients of /. 

We now state and prove the claim formally. The input to our algorithm is a vector f{Bd) S M '^ 
of desired coefficients up to degree d given to some accuracy 7. Further, in our applications we 
will only use vectors with at most 0(1/7^) non-zero coefficients since for every Boolean function 
at most 1/7^ of its Fourier coefficients are of magnitude greater than 7 and smaller coefficients are 
approximated by 0. 

Theorem 4.1 There exists a randomized algorithm PTFapprox that for every boolean function 
f : { — 1, l}*^ — >■ { — 1, 1}, given 7 > 0, (5 > a degree hound d and a succinctly-represented vector of 
coefficients f{Bd) G M-^^ such that \\f{Bd)-f{Bd)\\oo < 7 and ||/(-Brf)||o = 0(1/7^), with probability 
at least 1 — 5, outputs a bounded function g : {—1,1}"' — >• [—1,1] such that \\f{Bd) — g{Bd)\\oo < 57. 
The algorithm runs in time polynomial in n, I/7 and log (1/(5). 

Proof: We build g via the following iterative process. Let go = 0. At step t, given gt, we run the 
KM algorithm (Th. 12. ID to compute all the Fourier coefficients of gt which are of degree at most 
d to accuracy 7/2. Let gt{Bd) S MP'^ denote the vector of estimates output by the algorithm. By 
Theorem 12. 11 there are at most 16/7^ non-zero coefficients in gt{Bd)- For now let's assume that the 
output of the KM is always correct; we will deal with the confidence bounds later in the standard 
manner. 

If \\gt{Bd) — f{Bd)\\oo < 77/2, then we stop and output gt- By triangle inequality, 

\\f{Bd) - gt{Bd)U < \\f{Bd) - f{Bd)U + WhBd) - gt{Bd)U + \\9t{Bd) - gt{Bd)\\oo 

< 7 + 77/2 + 7/2 = 57 , 
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in other words gt satisfies the claimed condition. 

Otherwise, there exists a £ Bd such that \gt{a) — f{a)\ > 77/2. We note that using the succinct 
representation of f{Bd) and gt{Bd) such a can be found in 0(n(||5^||o + ||/||o)) = 0{n/j'^) time. 
First observe that, by triangle inequality, 

\gt{a) - f{a)\ > \gt{a) - f{a)\ - \f{a) - f{a)\ - \gt{a) - gt{a)\ < 77/2 - 7 - 7/2 = 27. 

Let g[_^^i = gt + {f {o)—gt{o))Xa- The Fourier spectrums oi gt and g[j^i differ only on a. Therefore, 
by using Parseval's identity, we obtain that 

E[(/ - gtf] - E[(/ - g[^,f] = {f{a) - gt{a)f - (/(a) - /(a) + gt{a) - g{a)f 

> {2jf - (37/2)2 = 77V4 . (3) 

Now let gt+i = Pi{gt)- For every x, {f{x) — gt+iix))"^ < {f{x) — g[j^i{x))'^ . Together with equation 
© this implies that E[(/ - gt+if] < E[(/ - gtf] - 7-t'^/A. At step we have E[(/ - gof] = 1 and 
therefore the process will terminate after at most 4/(77^) steps. 

We note that in order to make sure that the success probability is at leat 1 — 5 it is sufficient to 
run the KM algorithm with confidence parameter 45/ (77^). At step t evaluating gt on any point x 
takes 0{t ■ n) time and therefore each invocation of the KM algorithm takes 0{'n?' ■ 7"*^ • log (1/(5)) 
time. Overall this implies that the running time of PTFapprox is 0{'n? ■ ^~^^ ■ log {1/5)). D 

A simple observation about PTFapprox is that it does not rely on the update step being a 
multiple of a boolean function. Therefore it would work verbatim for any orthonormal basis and 
not only parities. Therefore, by using the EKM algorithm in place of KM we can easily extend our 
algorithm to any product distribution. 

Theorem 4.2 There exists a randomized algorithm PTFapproxProd that for every /i £ (— l,!)", 
boolean function f : { — 1,1}" — )■ { — 1,1}, given fj,,^ > 0,5 > 0, a degree bound d and a succinctly- 
represented vector of coefficients f^iBd) £ M^'* such that \\f^{Bd)-f^{Bd)\\oo < 7 o-'^d ||/^(-Brf)||o = 
0(1/7^), with probability at least 1—5, outputs a function g : {—1, 1}"" — >• [—1, 1] such that \\f^{Bd) — 
9fiiBd)\\oo < 57. The algorithm runs in time polynomial in n, I/7 and log {1/5). 

4.1 A Proper Construction Algorithm 

One disadvantage of this construction is that g output by PTFapprox is not a PTF itself. The 
reason for this is that the projection operation Pi is applied after every update. We now show that 
instead of applying the projection step after every updat e it is sufficient to apply the projection once 
to all the updates. This idea is based on Impagliazzo's (119951) argument in the con text of hardcore 



s 



set construction, and is also the basis for the algorithm of iTrevisan et al.l ( 20091 ). Impagliazzo 
proof uses the same L2 squared potential function but requires an additional point-wise counting 
argument to prove that the potential can be used to bound the number of steps. Instead, we 
augment the potential function in a way that captures the additional counting argument and 
generalized to non-boolean functions (necessary for the product distribution case). As a result the 
algorithm will output a function of the form A(X^agB O-aXa) which is then converted to a PTF 
by applying the sign functio n. The same idea is also used in the Chow parameter reconstruction 



algorithm of iDe et al.l (J2012l ). The modified proof also allows us to easily derive a bound on the 
total integer weight of the resulting PTF and optimize the running time of the algorithm (the 
optimization of running time is deferred to a full version of this work) . 
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Theorem 4.3 There exists a randomized algorithm PTFconstructProd that for every jj, € (—1, 1)", 
boolean function f : { — 1, 1}" — s- { — 1, 1}, given ^u, 7 > 0, (5 > 0, a degree bound d and a succinctly- 
represented vector of coefficients f^{Bd) G M'^'* such that \\f^{Bd)-f^{Bd)\\oo < 7 and \\f^{Bd)\\o = 
0(1/7^), with probability at least 1 — 6, outputs a bounded function g : { — 1,1}" — )• [—1,1] such 
that WffiiBci) — g^iB(i)\\oo < 57. The algorithm runs in time polynomial in n, I/7 and log (l/d). In 
addition, g{x) = Pi{g'{x)) for a degree-d polynomial such that g' ^ = ^ ■ P^ where p/^i is a vector of 
integers and \\p^\\i < 1/(27^). 

Proof: As in the proof of Theorein l4.lt we build g via an iterative process starting from ^q = and 
go = Pi(5q). We use the EKM algorithm (Th. I2.2p to compute gt^{Bd) and stop and return gt if 
\\gt^{Bd)- fn{Bd)\\oo < 77/2. Otherwise (there exists a £ B^ such that \gtfj.{a)- ff,{a)\ > 77/2 and 

\9t^,{a) - U{a)\ > 27), we let 7' = 7 • sign (/^ (a) - gt^^ia)), g't+i = g't + l'Xa,tM and gt+i = Pi{g't+i). 
We prove a bound on the total number of steps using the following potential function: 

Eit) = E^[(/ - gtf] + 2E^[(/ - gt)igt - g',)] = E^[(/ - gt)if - 2g', + gt)]. 

The key claim of this proof is that E{t) — E{t + 1) > 7^. First, 

Eit) - Eit + 1) = B^[if-gt)if-2g[+gt)]-B^[if-gt+i)if-2g't^,+gt+,)] 
= E^ [(/ - <7t)(25l+i - 2gi) - igt+i - gt)i2g[+i - gt - gt+i)] 
= E^[2if-gt)-f'xaJ-EAi9t+i-9t)i2g't+i-gt-gt+i)] (4) 

We observe that E/,[2(/-5t)7'xa J = 27'(/M(a)-5t^(a)) and that sign(/^(a)-5t^(a)) =sign(/^(a)- 
gtf_iia)). Therefore, we get 

E^[2(/ - gt)j'xa] > 2j\gtAa) - Uia)\ > 47' • (5) 

To upper-bound the expression E^ [igt+i — 9t)i2g't^i — 9t — gt+i)] we prove that for every point 

xe{-i,ir, 

igt+iix) - gtix))i2g't^^ix) - gtix) - gt+iix)) < 2-f^XaA^f- 

We first observe that \gt+iix)-gtix)\ = \Piig[ix)+j'xa,tii^))-Piig'tix))\ < h'XaA^)\ = hXa,tii^)\ 
(a projection operation does not increase the distance). Now 

125-4+1(2;) - gtix) - gt+iix)\ < Ig't+iix) - gtix)\ + |(fif[+i(x) - 5-4+1 (x)]. 

The first part \g't+iix) - gtix)\ = Ij'xaA^) + g'tix) - gtix)\ < h'XaA^)] unless g'tix) - gtix) / 
and g'tix) — gtix) has the same sign as 'y'xa,fMix)- However, in this case gt+iix) = gtix) and as a 
resuh igt+iix) - gtix))i2g[^^ix) -gtix) - gt+iix)) = 0. Similarly, \g[^;^^ix) - gt+iix)\ < \-f'xaA^)\ 
unless gt+iix) = gtix). Altogether we obtain that 

igt+iix)-gtix))i2g[^^ix)-gtix)-gt+iix)) < max{0, |7Xa,/.(a;)|(|7'Xa,A.(2;)|+|7'Xa,/.(2;)|)} = 2j^XaA^y 

This implies that 

E;. [igt+1 - gt)i2g't+i -gt- gt+i)] < 2^^'E:^[xaA^?] = 27'- (6) 

By substituting equations ([5]) and ^ into equation ([3]), we obtain the claimed decrease in the 
potential function 

Eit) - Eit + 1) > 47^ - 27^ = 27^. 
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We now observe that E{t) = B^[{f - gtf] + 2E^[(/ - gt){gt - §[)] > for all t. This follows 
from noting that for every x and f{x) G { — 1,1}, either /(x) — Pi{g^{x)) and Pi{g'i-{x)) — g'ti^)) 
have the same sign or one of them equals zero. Therefore E^[(/ — gt){gt — g't)] ^ (and, naturally, 
^fiiif ~ 9t)'^] ^0)- It is easy to see that £'(0) = 1 and therefore this process will stop after at most 
1/(272) steps. 

The claim on the representation of gt output by the algorithm follows immediately from the 
definition of 5ft = Piig't) and g'^ being a sum of t /i- Fourier basis functions multiplied by ±7. D 

5 Applications to Learning DNF Expressions 

We now give several application of our approximating algorithms to the problem of learning DNF 
expressions in several models of learning. Our first application is a new algorithm for learning DNF 
expressions using membership queries over any product distribution. In the second application 
we show a simple algorithm for learning DNF expressions from random examples coming from a 
smoothed product distribution. In the third application we give a new and faster algorithm for 
learning MDNF over product distributions (from random examples alone). We describe all the 
applications for (M)DNF expressions. However, by using the more general Theorem 13.91 in place 
of Theorem 13.81 we immediately get that our algorithms can be also used to learn a broader set of 
concept classes which includes, for examples, (monotone) majorities of terms. Previous algorithms 
for the second and third applications rely strongly on the term-combining function being an OR. 

5.1 Learning with Membership Queries 

An immediate application of Theorem 14.21 together with the bound in Theorem 13.81 and the EKM 
algorithm (Th. 12. 2p is a simple algorithm for learning DNF over any constant-bounded product 
distribution. 

Corollary 5.1 Let c € (0, 1] be a constant. There exists a membership query algorithm DNFLearnMQProd 
that for every c-bounded fi, efficiently PAC learns DNF expressions over D^. 

Proof: Let e' = e/9 and, as defined in Th. ESI let d = [log (s/e')/log (2/(2 - c))J and 

7 = 67(2(2 - cf'^S + l)=n (^(,/,)(l/log(2/(2-c))+l)/2^ _ 

DNFLearnMQProd consists of two phases: 

1. Collect 7-approximations to all degree-d ^-Fourier coefficients. In this step we run 
the EKM algorithm for / with parameters, = 7, 5 = 1/4 and /U to obtain a succinctly- 
represented f^{Bd) such that ||/^(-Bd) — gf_i{B(i)\\oo < 7 (EKM returns the complete /^ but 
we discard coefficients with degree higher than d). 

2. Construct a bounded g ^vith the given //-Fourier spectrum. In this step we run 
PTFapproxProd on ffj_{B(i) with parameters d, 7, /i and 5 = 1/4 to construct a bounded 
function g such that ||/^(-Bd) - g^,{Bd)\\oo <^1 = 5e7(2(2 - cf/'^s + 1). Note that this step 
requires no access to membership queries or random examples of /. 
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We return sign(g(x)) as our hypothesis. Overall, if both steps are successful (which happens with 
probability at least 1/2) then, according to Theorem | 



E^[|/ - g\] < \\f^{Bd) - g^,{Bd)\\oo ■ (2(2 - c)''/h + 1) + 4e' = 57 • (2(2 - c)^/h + 1) + 4e' = 9e' = e. 

This implies Pr^[/ / s\gn{g)] < E^[|/ - g\] < e. 

The running time of both phases of DNFLearnMQProd is polynomial in n, and I/7, which for 
any constant c G (0, 1], is polynomial in n ■ s/e. D 

As noted in the proof, the only part of our algorithm that uses membership queries is the phase 
that collects Fourier coefficients of logarithmic degree. This step can also be performed using weaker 
forms of access to the target function, such as extended statistica l queries of iBshoutv and Feldman 
( 2OO2I ) or examples coming from a random walk on a hvpercube iBshoutv et al.l ( 20051 ). Hence our 
algorithm can be adapted to those models in a straightforward way. 



5.2 Smoothed Analysis of Learning DNF over Product Distributions 

We now describe how PTFapproxProd can be used in the contex t of smoothed analysis of learning 
DNF over product distributions introduced bv lKalai et al.l (|2009bl ) . We start with a brief description 
of the model. 



5.2.1 Learning from Smoothed Product Distributions 

Motiva ted by the seminal model of smoothed analysis bv ISpielman and Tengl ( 2004 ) . iKalai et al 



(J2009bl ) defined learning a concept class C with respect to smoothed product distributions as 
follows. The model measures the complexity of a learning algorithm with respect to a product 
distribution Z?^ where fi is "perturbed" randomly. More formally, fi is chosen uniformly at random 
from a cube fl + [— c, c]" for a 2c-bounded p,. A learning algorithm in this model must, for every fl 
and f ^ C, PAC learn / over Z)^ with high probability over the choice of /u. 

Definition 5.2 (IKalai et al.ll2009bl ) Let C be a concept class. An algorithm A is said to learn 
C over smoothed product distributions if for every constant c S (0, 1/2], f € C, e,5 > 0, and any 
2c-bounded fl, given access to EX{f,Dfj_) for a randomly and uniformly chosen fi G p, + [—Cjc]"^, 
with probability at least 1 — 5, A outputs a hypothesis h, e-close to f relative to D^^. The probability 
here is taken with respect to the random choice of fx, choice of random samples from D^ and any 
internal randomization of A. A is said to learn efficiently if its running time is upper-bounded by a 
polynomial in n/{e-5) (and the size s of f if C is parameterized) where the degree of the polynomial 
is allowed to depend on c. 



Feature Finding Algorithm. A key insight in the results of iKalai et al.l (|2009bl ) is that if a 
bounded function / has a low-degree significant /i- Fourier coefficient //i(a), then after the pertur- 
bation / will have significant ^-Fourier coefficients for all 6 < a (here b < a means 6j < aj for all 
i E [n]). This insight leads to a simple method for finding all the significant /Li-Fourier coefficients 
of degree d in time polynomial in 2"^ instead of n'^ required by the Low Degree algorithm. 

Theorem 5.3 (Greedy Feature Construction (GFC) (|Kalai et al.l . r2009bl )^ Letce (0,1/2] 
be a constant. There exists an algorithm that for every f : {—1, 1}"" — >■ [—1, 1], d G [n], 9,5 > 0, 
2c-bounded p,, given access to EX{f,D^) for a randomly and uniformly chosen fj, £ p + [—c,c]'^, 
with probability at least 1 — 6, outputs a succinctly-represented vector f{B(i) such that ||/^(i?rf) — 
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U{Bd)\\oo < and \\ff,{Bd)\\o < V^^. The algorithm runs in time 0{{n ■ 2'^/{e ■ 5)f^'''^) for some 
constant k[c) which depends only on c. 



5.2.2 Application of PTFapproxProd 

The Greedy Feature Construction algorithm gives an efficient algorithm for collecting /^-Fourier 
coefficients of logarithmic degree. The application of PTFapproxProd in this setting is now straight- 
forward. All that needs to be done is to replace the EKM algorithm in the coefficient collection 
phase of DNFLearnMQProd (Cor. 15. ip with the GFC algorithm. The coefficient collection phase of 
DNFLearnMQProd requires only coefficients of logarithmic degree in the learning parameters and 
therefore the resulting combination runs in polynomial time (the approximator construction phase 
is unchanged and still u ses the EKM algor ithm). Thereby we obtain a new simple proof of the 



following theorem from ( Kalai et al.l . l2009bl ). 



Theorem 5.4 ( Kalai et al.ll2009bl ) DNF expressions are P AC learnable efficiently over smoothed 
product distributions. 



5.3 Learning Monotone DNF 

We now describe our algorithm for learning monotone s-term DNF from random examples alone. 
For simplicity, we describe it for the uniform distribution, but all the ingredients that we use have 
their product distribution versions and hence the generalization is straightforward (we describe 
it in Appendix |A|). As pointed out ear lier, our algorithm is based on Servedio's algorithm for 
learning monotone DNF ( Servedid . 12004 ) . The main idea of his algorithm is to restrict learning to 
influential variables alone (which for a monotone function can be efficiently identified) and then run 
the Low Degree algorithm l2.3l to approximate all the Fourier coefficients of low degree on influential 
variables. The sign of the resulting low-degree polynomial p{x) is then used as a hypothesis. The 
degree that is known to be sufficient fo r suc h approxima t ion to work was derived using a Fourier 
concentration bound by iMansoun (jl995l ) and lLinial et al.l (119931 ) and equals 20 • log (s/e) • log (1/e). 

In our algorithm, instead of just taking the sign of p{x) as the hypothesis, we use PTFapprox 
to produce a bounded function with the same Fourier coefficients as p{x). The advantage of this 
approach is that the degree bound required to achieve e-accuracy using our approach is reduced to 
lo g js/e) + p(l) (a nd is also significantly easier to prove than the Switching Lemma-based bound 
of lMansouii ( 19951 )). Further, the accuracy estimation in our algorithm does not depend on n the 
number of sufficiently influential variables does not depend on n. As a consequence our algorithm 
is attribute-e fficient . 

Following IServedid (200J), we rely on a well-known connection between the influence of a vari- 
able and Fourier coefficients that include that variable. Formally, for a function / : {—1,1}'^ — )■ 
{—1, 1} and i G [n] let /j,i(x) and /j^_i(x) denote f{x) with bit i of the input set to 1 and —1, respec- 
tively. The influence of variable i over distribution D is defined as lD,i{f) = 'P^ D[fi,i{x) ^ /j^_i(x)]. 
We use Ii(f ) to de note the influence over the uniform distribution. Let Si = {a ^ {0, 1}" | Cj = 1}. 
Kahn et al.l ( 19881 ) have shown that for every i G [n]. 



hU) = E /(^ 



? = \\f{s^)\\l 



a&Si 



(7) 
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The crucial use of monotonicity is that for any monotone /, lD,i{f) = (^D[fi,i{x)]—'E£)[fi^^i{x)])/2 
and hence one can estimate ||/(5'j)||2 using random uniform examples of /. We now describe our 
algorithm for learning monotone DNF over the uniform distribution more formally. 

Theorem 5.5 There exists an algorithm that PAC learns s-term monotone DNF expressions over 
the uniform distribution to accuracy e in time 0{n • (s • log {s/e))^^^°^^^ '''''). 

Proof: Our algorithm is based on the same two phases as DMFLearnMQProd in Corollary 15. li Hence 
we set e' = e/9, d = [log (s/e')J and 7 = e'/{2s + 1). 

The goal of the first phase of the algorithm is to collect 7-approximations to degree-d Fourier 
coefficients of /. We do this by first finding the influential variables and then using a low-degree 
algorithm restricted to the influential variables. 

Using equation d?]), we can conclude that if for some variable i, Ii{f) = ||/(»S'i)||2 < 7^, then 
there are no Fourier coefficients of /, that include variable i and are greater in their magnitude 
than 7. We can therefore eliminate variable i, that is approximate all of Fourier coefficients in Si 
by 0. Also, as we mentioned before, /«(/) can be estimated from random examples of /. We will 
use an estimate to accuracy 7^/3 and exclude variable i if the estimate is lower than 27^/3 (the 
straightforward details of the required confldence bounds appear in the more detailed and general 
proof of Theorem I5.6p . 

We argue that this process will eliminate all but at most s • log (3s/7^) variables. This, follows 
from the fact that if a variable i appears only in terms of length greater than log (3s/7^) then it 
cannot be influential enough to survive the elimination condition. Over the uniform distribution, 
each term of length greater than log(3s/7^) equals 1 with probability at most 7^/(3s). The value 
fi,i{x) differs from /j^_i(x) only if x is accepted by a term that includes variable i. There are at 
most s terms and therefore (for a variable i that appears only in terms of length log (3s/7^)) 

(E[/,,i(x)] - E[/,,_i(x)])/2 < s ■ 7V(3s) = 7V3. 

Consequently, the influence of such variable i cannot be within 7^/3 of 37^/3 (required to survive 
the elimination). Therefore at the end of the first step we will end up with variables only from 
terms of length at most log (3s/7^). Hence there will be at most s • log (3s/7^) variables left. Let 
M denote the set of the remaining (influential) variables. 

In the second step of this phase we run the low-degree algorithm for degree d and = 7 = 
e' /{2s + 1) restricted to the variables in M, and let f{B(]) be the resulting vector of approximate 
Fourier coefficients (the coefficients with variables outside of M are 0). By Theorem 12.31 and the 
property of our influential variables \\f{Bd) — f{B(i)\\oo < 7- 

We can now construct an approximating function in the same way as we did in DNFLearnMQProd 
(Cor. 15. ip . Namely, in the third step of the algorithm we run PTFapprox on /(-B^) to obtain a 
bounded function g such that \\f{Bd) — g{Bd)\\oo < 57 = 5e'/(2s + 1). Then, by Theorem 13. 8( 

E[|/ - g\] < (2s + l)\\f{Bd) - g{Bd)\\oo + 4e' < {2s + 1) • he' /{2s + 1) + 4e' = 9e' = e. 

Hence Pr[sign(g() 7^ /] < e. 

To analyze the running time of our algorithm we note that both the first and the third steps 
can be done in 0{n) ■ poly(s/e) time. According to Theorem 12.31 the second step can be done in 

n ■ \M\'^ ■ poly(|M|/7) = n ■ {s ■ log (s/e))*^'-^"^*-**''^-*^ time steps. Altogether, we obtain the claimed 
bound on the running time. D 
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A corollary of our running time bound is that for s and e such that s/e = 2^ ^S", s-term mono- 
tone DNF are learnable to accuracy e in polynomial time. Servedio's algorithm is only guaranteed 
to efficiently learn 2^^°§'^-term MDNF to constant accuracy. 

We remark that the bound on running time can be simplified for monotone s-term fc-DNF 
expressions. Specifically, we will obtain an algorithm running in [s • k)^^^' ■ {n/e)^^^' time. This 
algorithm can be used to obtain fully-polynomial learning algorithms for monotone 2^'°§"-term 
■^/logn-DNF and other subclasses of MDNF expressions for which no fully-polynomial learning 
algorithms were known. 

In Appendix [X] we give the straightforward generalization of our learning algorithm to product 
distributions and prove the following theorem. 

Theorem 5.6 For any constant c £ (0,1] there exists an algorithm MDNFLearnProd that PAC 
learns s-term monotone DNF expressions over all c-bounded product distributions to accuracy e in 
time d{n-{s- log (s/e))^(i°s(«A))), 
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A Proofs of Some Generalizations 

Theorem A.l [restatement of Th. \3.9^ Let c £ (0, 1] be a constant, fi be a c-bounded distribution 
and e > 0. For an integer s > let f = h{ui,U2, ■ ■ ■ ,Us), where h is an LTF over {—1, 1}* and Ui 's 
are terms. For d = [log (VF/(/i)/e)/log (2/(2 — c))J and every bounded function g : {—1,1}" — >■ 

[-1,1], 

E^[|/(x) - gix)\] < (2 • (2 - c^/^ + 1) • W^h) ■ 11/^(5^) - g^iB^oo + 46. 

For c = 1, (2 - c)'^/2 = 1 and for c G (0, 1), (2 - c)'^/^ < {Wl{h)/eY^/^°>^^^/('^-^^^-^y^ . 

Proof: Let w = {wo,wi, . . . ,Wn) be the weight vector of h such that the linear function q{y) = 
Z^igM ^iyi + '"^o 1-sign-represents /i(7/) and \\w\\i = Wl{h). Let ^(x) = X^iefsl '"^«^«(^) + ^o- Now let 
M C [s] denote the set of indices of /'s terms which have length > d+1 > log {Wl{h)/€)/ log (2/(2 — c)) 
and let p'{x) = J2ii^M '^i'^ii^) + ""^o — SieM ^«- 1^ other words, p' is p with each term Uj for i G M 
replaced by constant —1. 
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For each i G M, ^^[\ui{x) + 1|] = 2Pr^[ni(x) = 1] < 2(1 - c/2)'^+^ < 2e/Wl{h). This imphes 



that 



E;,[|p(x) -p{x) 



E, 



y^ wi{ui{x) + 1] 



<^\w^\■E^[\ui{x) + l\]<2e. (8) 



For every i G M, let ti{x) = Ui{x)/2 + 1/2, be the {0, 1} version of term Uj. Lemma 13.71 imphes 



that 



Ui 



■ifi 



'Bd)\\i < nUJBd)\\i + 1 < 2 • (2 - 0)^^/2 ^ i_ 



(9) 



The polynomial p'{x) is of degree d and, using inequality ([9]), we obtain 
< y^ \wA ■ \\ui,,{Bd)\\i + ^ |t(;i| + |w;o| 






H^\ 



i&M 



< Y^ \w,\ • 2 • (2 - c)'^/2 ^ ^ |^.| + i^^i < ^i(/,)(2 . (2 - c)'='/2 + i)_ (10) 



i^M 



«es 



We can now apply Lemma 13.41 and equations (|8l llOp to obtain 

E^[|/(x) -5(x)|] < WUb,) - g,{B,)\\^ • ||pV(i?rf)||i + 2E^[|p'(x) - p(x)|] 
< (2 • (2 - cf/^ + 1) • Wlih) ■ \\f^{Bd) - h^{Bd)\\o. + 4e. 



D 



Theorem A. 2 (restatement of Th. 15. 6p For any constant c £ (0, 1] there exists an algorithm 
MDNFLearnProd that PAC learns s-term monotone DNF expressions over all c-bounded product 
distributions to accuracy e in time 0{n ■ {s ■ log (s/e))'-^''°s(*/'^))). 

Proof: As in the proof of Theorem 15. 5( MDNFLearnProd is based on two phases: in the first phase 
we collect /i-Fourier coefficients of the target function / using a low-degree algorithm restricted 
to influential variables; in the second phase we construct an approximating function given the 
/i-Fourier spectrum. 

Let Dfj^ denote the target c-bounded distribution. The identificat ion of influential vari a ,bles is 
based on the generalization of equation ([7]) to product distribution by lBshouty and TamonI ( 19961 ): 
for every product distribution ^ and z E [n]. 



aeSi 



(11) 



As in DNFLearnMQProd, we set e' = e/9 and d = [log (s/e')/log (2/(2 - c))J and 7 = e7(2(2 - 
cy/-^s + 1) = J^ ((g/g)(i/iog{2/(2-c))+i)/2^ (j^g defined in Th.^^- 

Let c' = 4c(l — c). Using equation ([TT]) . we can conclude that if for some variable i, lD^,i{f) = 
4/ij(l — //j)||/^(5i)||2 < c'7^, then there are no ^-Fourier coefficients of /, that include variable i and 
are greater in their magnitude than 7. We can therefore eliminate variable i, that is approximate 
all of //-Fourier coefficients in Si by 0. By definition, for a monotone /, lD^,i{f) = (E^[/i,i(x)] — 
E^j[/j,-i(x)])/2 and therefore /_D^,i(/) can be estimated empirically from random examples of /. 
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We estimate each lD^,i{f) to accuracy c' ■ 7^/3 with confidence 1 — n/6. The standard Chernoff 
bounds imply that 0(7~^ • logn) examples are sufficient for this. We exclude variable i if the 
obtained estimate is lower than c' • 7'^/3. 

We argue that this process will eliminate all but at most 0{s ■ log(s/e)) variables. This, 
follows from the fact that if a variable i appears only in terms of length greater than d' = 
log (3s/(c' • 7^))/log (2/(2 — c)) then it cannot be influential enough to survive the elimination 
condition. Over a c-bounded distribution D^, each term of length > d' equals 1 with probability 
at most (1 — c/2)'^ < c' • 7^/(3s). The value /i,i(x) differs from fi^^i{x) only if x is accepted by a 
term that includes variable i. There are at most s terms and therefore (for a variable i that appears 
only in terms of length > d') 

(E^[/,,i(x)] - E^[/^„_i(x)])/2 <s-c'- 7V(3s) = c ■ 7^3. 

Consequently, such a variable cannot produce an estimate within c' • 7^/3 which is at least c' ■ 27^/3 
meaning that at the end of the first step we will end up with variables only from terms of length 
at most d' = 0(log (s/e)). Hence there will be at most 0{s ■ log (s/e)) variables left. Let M denote 
the set of remaining (influential) variables. 

In the second step of MDNFLearnProd we run the low-degree algorithm for degree d, 6 = 'j 
and confidence 1/6 restricted to the variables in M, and let /^(-Bd) be the resulting vector of 
approximate //-Fourier coefficients (the coefficients with variables outside of M are 0) . By Theorem 
[231 with probability at least 5/6, ||//,(-Bd) - /^(-Bd)||oo < 7- 

We can now construct an approximating function in the same way as we did in DNFLearnMQProd 
(Cor. 15. ip . Namely, in the third step of the algorithm we run PTFapproxProd on f{Bd) restricted 
to the variables in M, to obtain, with probability at least 5/6, a bounded function g such that 

Wf^^iBd) - g^{Bd)\\oo < 57 = 5e7(2(2 - cf/^s + 1). 

Then, by Theorem 13. 8( 

E^[|/ - g\] < \\f^{Bd) - g^,{Bd)\\oo ■ (2(2 - cfh + 1) + 4e' = 57 • (2(2 - c)''/^ + 1) + 4e' = 9e' = e. 

Hence, with probability at least 1/2, we will output g such that Pr^ [sign (51) / /] < e. 
To analyze the running time of our algorithm, we note that for a fixed constant c. 



1/^ = O (^(s/e)(l/log(2/(2-c))+l)/2^ ^ poly(s/6) 



The first step of the algorithm takes 0(n7~^) time. According to Theorem 12.31 the second step can 
be done in n- \M\ •poly(|M|/7) = n- (slog (s/e))'-^('°s('*/'^)) time steps (the factor n comes from the 
fact that obtaining an individual random example and restricting it to the infiuential variables takes 
0{n) time steps). According to Corollary 15.11 the third step can be done in n ■ poly(|M|, I/7) = 
n ■ poly(s/e) time steps. Altogether, we obtain the claimed bound on the running time. D 
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